Linear regression is a means to test a relationship between independent variables(s) (i.e., manipulated variable(s)) and one dependent variable (response variable). Depending on the conditions of the independent variable(s) you may be testing a model that could be called a linear regression, ANOVA, or ANCOVA. We will discuss these differences and similarities. These models also have specific assumptions that must be met to produce estimates of the relationship that are not biased and useful in creating predictions and interpreting the results.
The ANOVA model is a unique class of linear models where the response variable is quantitative (continuous) and the predictors are one or more nominal (categorical) variables. The term ANOVA comes from Analysis Of Variance, which is a method for separating the variance of a group of observations, which was invented by Ronald Fisher (over a century ago).
ANOVA is further categorized by the number of nominal variables included in the analysis. A one-way ANOVA has a single predictor. The following example shows the important sequential steps that should be used to complete a one-way ANOVA.
By reviewing the data and its characteristics you can confirm the appropriate analysis and understand the context of the hypothesis. This will also be helpful when considering how to visualize the data to help interpret the results.
This example is a pot-experiment to compare weed control efficacy of two herbicides used alone and in mixture. A control was also added as a reference. The four treatments are 1) Metribuzin, 2) Rimsulfuron, 3) Metribuzin+Rimsulfuron, and 4) Untreated Control.
Sixteen uniform pots were prepared and sown with Solanum nigrum. When the plants reached the 4-true-leaves stage they were randomly sprayed with one of the above herbicide solutions following a completely randomized design (will be covered in more detail during the Experimental Design lecture) with four replicates. Three weeks after the treatment, the plants in each pot were harvested and weighed. The theory dictates that the lower the weight the higher the efficacy of herbicides. Our hypothesis was that there was a difference in the response (i.e., weight) among the different treatments. Identifying a treatment that produces the lowest weight would suggest better control of weed growth.
When organizing data (commonly in an excel file) note that the dataset is in a “tidy” format, with one row per observation and one column per variable. This is known as the long format and is what most functions expect to be used in R. Other data formats might be useful for some visualizations, but the data can be transformed into other formats when necessary.
## Rows: 16 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Treat
## dbl (1): Weight
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The summary function of the data object produces summary statistics of each variable in the dataset. We can also use the tapply function to produce the treatment means and standard deviations for each treatment level. This can also be done using the dplyr function summarize. We get a glimpse at the treatment averages and the variation in standard deviation between treatments. We observed differences in the sample means, but our goal is to determine differences in the population means.
## Treat Weight
## Length:16 Min. : 1.950
## Class :character 1st Qu.: 6.635
## Mode :character Median :12.850
## Mean :14.484
## 3rd Qu.:21.560
## Max. :30.940
treatMeans <- tapply(One_Way_Anova$Weight, One_Way_Anova$Treat, mean)
SDs <- tapply(One_Way_Anova$Weight, One_Way_Anova$Treat, sd)
descrit <- data.frame(treatMeans, SDs)
descrit| treatMeans | SDs | |
|---|---|---|
| Metribuzin__348 | 9.1750 | 4.699089 |
| Mixture_378 | 5.1275 | 2.288557 |
| Rimsulfuron_30 | 16.8600 | 4.902353 |
| Unweeded | 26.7725 | 3.168674 |
Fitting a linear model with weight as the response variable and the factor of Treatment as the independent variable. You can either call the variable as a factor within the lm function, or you can convert the variable from a character to a factor. You can use the str function or look at the data object in the Environment tab.
## spc_tbl_ [16 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Treat : chr [1:16] "Metribuzin__348" "Metribuzin__348" "Metribuzin__348" "Metribuzin__348" ...
## $ Weight: num [1:16] 15.2 4.38 10.32 6.8 6.14 ...
## - attr(*, "spec")=
## .. cols(
## .. Treat = col_character(),
## .. Weight = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
mod <- lm(Weight ~ factor(Treat), data = One_Way_Anova)
One_Way_Anova$Treat <- factor(One_Way_Anova$Treat)
str(One_Way_Anova)## spc_tbl_ [16 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Treat : Factor w/ 4 levels "Metribuzin__348",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ Weight: num [1:16] 15.2 4.38 10.32 6.8 6.14 ...
## - attr(*, "spec")=
## .. cols(
## .. Treat = col_character(),
## .. Weight = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Following the linear model being fit, you can call the parameter estimates using the summary function. You can see the call which shows the terms included in the model, the summary statistics of the residuals, a table of coefficients, and some other statistical results. This summary output is less useful for ANOVA, but we will revisit this output for linear regression.
##
## Call:
## lm(formula = Weight ~ Treat, data = One_Way_Anova)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.360 -2.469 0.380 2.567 6.025
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.175 1.959 4.684 0.000529 ***
## TreatMixture_378 -4.047 2.770 -1.461 0.169679
## TreatRimsulfuron_30 7.685 2.770 2.774 0.016832 *
## TreatUnweeded 17.598 2.770 6.352 3.65e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.918 on 12 degrees of freedom
## Multiple R-squared: 0.8554, Adjusted R-squared: 0.8193
## F-statistic: 23.66 on 3 and 12 DF, p-value: 2.509e-05
After fitting a model the values you are most interested in reviewing are the fitted values and the residuals, which can be extracted using the fitted() and residuals() functions. The fitted values are the same as the means calculated above. The residuals are the deviation each observation is away from the mean value, and will be used for later steps.
expected <- fitted(mod) #Not too useful but included here to show how to extract these values
epsilon <- residuals(mod)The ANOVA table is commonly obtained using one of two methods from either the stats or car package. Certain model types we use in future more complicated models may prefer one over the other. There may be some guidance we can provide, but other situations may require some troubleshooting. It will be helpful to remember these options.
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| Treat | 3 | 1089.5287 | 363.17624 | 23.66259 | 2.51e-05 |
| Residuals | 12 | 184.1774 | 15.34812 | NA | NA |
| Sum Sq | Df | F value | Pr(>F) | |
|---|---|---|---|---|
| Treat | 1089.5287 | 3 | 23.66259 | 2.51e-05 |
| Residuals | 184.1774 | 12 | NA | NA |
This is the type of output that is more often reported for ANOVA models. The interpretation of the F statistic and the p value follow commonly adopted thresholds where a p value < 0.05 is considered significant. The norm is to report the numerator and denominator degrees of freedom, the F value, and the p value (e.g., F(3,12)=23.663, p value < 0.0001). This says that the F-statistic from this model is 23.663 is larger than the critical value from the F-statistic table with numerator degrees of freedom 3 and denominator degrees of freedom 12. This test only tells us that there is a significant difference among the levels of the categorical variable. It doesn’t say which levels are significantly different.
The results of the model are only valid when the assumptions for linear models are met. These are sometimes called the Gauss-Markov Assumptions or BLUE (Best Linear Unbiased Estimator). These assumptions will be the same for ANOVA, ANCOVA, and linear regression.
Some of the assumptions have more to deal with how the data are collected and the experiment is designed - variation in the independent variables, random sampling, and linearity in parameters. The other assumptions are assessed after the model is tested - normally distributed errors, and homoskedasticity. There are other assumptions that are more theoretical in natures, and wont be discussed here.
The assessment of the residuals of the linear model can either be inspected visually in a graph, or by objective, formal hypotheses. The graphical methods are powerful in detecting strong deviations form the basic assumptions, and the formal hypotheses can be used when there are less clear answers.
To assess the normality of errors (residuals) we often use a QQ (quantile-quantile)-plot. The standardized residuals are plotted against the respective percentiles of a normal distribution. This can either be achieved using the qqnorm and qqline functions with the residuals - we called epsilon earlier, or using the ggplot function with the rstandard(mod) code. Make sure to run the qqnorm and qqline lines together.
The goal is to observe the points fall along the line as close as possible. Some deviation from the line is normal, but look out for strong evidence of a skewed distribution where the tails are further from the line.
To assess homoskedasticity we look at the plot of the residuals against the fitted values. If the assumption of no systematic error and homogeneous errors is met, then the residuals should be randomly scattered without any visible systematic patterns. We can look for potential outliers, funnel-like patter suggesting heteroscedasticity, or a positive/negative trend in the residuals over the fitted values.
Alternatively you can use the autoplot function from the ggfortify package to look at a comprehensive output and review the assumptions.
## Warning: `fortify(<lm>)` was deprecated in ggplot2 3.6.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
## Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggfortify package.
## Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
## Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
A formal hypothesis used to test the linear model assumptions for normally distributed residuals is the Shapiro-Wilks test on the residuals of the model. The null hypothesis is that the residuals are normally distributed. A significant result p value < 0.05 indicates the residuals are not normally distributed.
##
## Shapiro-Wilk normality test
##
## data: epsilon
## W = 0.97615, p-value = 0.9257
A formal hypothesis used to test the linear model assumptions for homogeneity of variance is the Levene’s test on the independent variables in the model. The null hypothesis is that the variance across all samples are equal. A significant result p value < 0.05 indicates there is one sample with a different variance. This is a useful test when you have a categorical independent variable where the residuals vs fitted plot would be more useful for a model with only continuous variables.
| Df | F value | Pr(>F) | |
|---|---|---|---|
| group | 3 | 1.356301 | 0.3028878 |
| 12 | NA | NA |
The expected marginal means will produce the arithmetic means when the experiment is balanced (same number of replicates within each group), but the means will be different when the experiment is unbalanced. We will use the emmeans() function to produce the expected marginal means (also known as least squares means).
## Treat emmean SE df lower.CL upper.CL
## Metribuzin__348 9.18 1.96 12 4.91 13.4
## Mixture_378 5.13 1.96 12 0.86 9.4
## Rimsulfuron_30 16.86 1.96 12 12.59 21.1
## Unweeded 26.77 1.96 12 22.50 31.0
##
## Confidence level used: 0.95
Following the evidence to support the linear model meets the assumptions, further analysis can be completed to produce unbiased estimates. The rejection of the null hypothesis that all treatment groups are equal suggests that there is at least one difference. The process to determine which treatment groups are significantly different from the other is known as post-hoc multiple comparisons tests.
Two general approaches for multiple comparisons tests include linear contrasts and pairwise comparisons.
Linear contrasts are a linear combination of means (model parameters) that add up to zero. One example from the herbicide treatments would be to compare the three herbicide treatments to the control 1/3+1/3+1/3-1=0 Each level of the treatment variable are given a constant that informs what levels are being compared. Here the average of the three treatments are being compared to the control. You can test multiple types of contrasts, but the constants have to sum to zero - this is known as orthogonal contrasts.
cont1 <- c(1/3, 1/3, 1/3, -1) #the average of the first three levels compared to the fourth
cont2 <- c(1/2, -1, 1/2, 0) #the average of the first and third level compared to the second
cont3 <- c(0, -1, 1, 0) #the second level compared to the third
cont4 <- c(1, -1, 0, 0) #the first level compared to the second
contrasts <- list(cont1, cont2, cont3, cont4)
contrast(mu_j, method = contrasts, adjust = "none") #emmeans from expected marginal means## contrast estimate SE df
## c(0.333333333333333, 0.333333333333333, 0.333333333333333, -1 -16.39 2.26 12
## c(0.5, -1, 0.5, 0) 7.89 2.40 12
## c(0, -1, 1, 0) 11.73 2.77 12
## c(1, -1, 0, 0) 4.05 2.77 12
## t.ratio p.value
## -7.244 <.0001
## 3.289 0.0065
## 4.235 0.0012
## 1.461 0.1697
The pairwise comparisons are more commonly used (sometimes overused) and follow two general approaches: 1) all pairwise comparisons, and comparisons with a control (Dunnett’s method). All pairwise comparisons will yield a large number of contrasts, and some might not be of primary interest based on the hypothesis. Carefully consider these steps within the context of your data and hypotheses. Furthermore, the “dunnett” method defaults to comparing each level to the first level in alphabetical order, which may not be the control level. You can call the levels of the treatment variable from the dataset object and find that the fourth level is the control.
## contrast estimate SE df t.ratio p.value
## Metribuzin__348 - Mixture_378 4.05 2.77 12 1.461 0.1697
## Metribuzin__348 - Rimsulfuron_30 -7.68 2.77 12 -2.774 0.0168
## Metribuzin__348 - Unweeded -17.60 2.77 12 -6.352 <.0001
## Mixture_378 - Rimsulfuron_30 -11.73 2.77 12 -4.235 0.0012
## Mixture_378 - Unweeded -21.64 2.77 12 -7.813 <.0001
## Rimsulfuron_30 - Unweeded -9.91 2.77 12 -3.578 0.0038
## [1] "Metribuzin__348" "Mixture_378" "Rimsulfuron_30" "Unweeded"
## contrast estimate SE df t.ratio p.value
## Metribuzin__348 - Unweeded -17.60 2.77 12 -6.352 <.0001
## Mixture_378 - Unweeded -21.64 2.77 12 -7.813 <.0001
## Rimsulfuron_30 - Unweeded -9.91 2.77 12 -3.578 0.0038
With a large number of contrasts it can be impractical to report the effect size of all comparisons. The cld (compact letter display) function helps to visualize the differences between the treatment levels. You can produce either lowercase letters (letters) or uppercase letters (LETTERS). You can also have the letters follow increasing or decreasing emmean values. Be mindful that the letter display doesn’t inform anything about the direction and size of the differences, which is it’s chief criticism in their use to present experimental results. This criticism can be alleviated when the letters are showed along side the raw data.
| Treat | emmean | SE | df | lower.CL | upper.CL | .group | |
|---|---|---|---|---|---|---|---|
| 2 | Mixture_378 | 5.1275 | 1.958834 | 12 | 0.8595676 | 9.395432 | a |
| 1 | Metribuzin__348 | 9.1750 | 1.958834 | 12 | 4.9070676 | 13.442932 | a |
| 3 | Rimsulfuron_30 | 16.8600 | 1.958834 | 12 | 12.5920676 | 21.127932 | b |
| 4 | Unweeded | 26.7725 | 1.958834 | 12 | 22.5045676 | 31.040432 | c |
mu_j_reversed <- mod %>%
emmeans(specs = "Treat") %>%
cld(adjust = "none", Letters = letters, reversed = TRUE)
mu_j_reversed| Treat | emmean | SE | df | lower.CL | upper.CL | .group | |
|---|---|---|---|---|---|---|---|
| 2 | Unweeded | 26.7725 | 1.958834 | 12 | 22.5045676 | 31.040432 | a |
| 1 | Rimsulfuron_30 | 16.8600 | 1.958834 | 12 | 12.5920676 | 21.127932 | b |
| 3 | Metribuzin__348 | 9.1750 | 1.958834 | 12 | 4.9070676 | 13.442932 | c |
| 4 | Mixture_378 | 5.1275 | 1.958834 | 12 | 0.8595676 | 9.395432 | c |
In experiments that have a high number of contrasts or simultaneously tested hypotheses, there is the potential for an increased type I error. This is known as the multiplicity problem. The adjust=“none” code above produces the results of the multiple comparisons without and adjustment to the p values and is known as Fisher’s Least Significant Difference (LSD). This is the least conservative approach to multiple comparisons given that there is no attempt to correct for the increased type I error. This approach has been aruged as appropriate when the p value of an F-test is highly significant, but this is up for debate.
Other approaches make adjustments to the p values and the confidence intervals of the pairwise comparisons. The Sidak method is one approach that can be used with the following argument - adjust=“sidak”.
| Treat | emmean | SE | df | lower.CL | upper.CL | .group | |
|---|---|---|---|---|---|---|---|
| 2 | Mixture_378 | 5.1275 | 1.958834 | 12 | -0.6004451 | 10.85545 | a |
| 1 | Metribuzin__348 | 9.1750 | 1.958834 | 12 | 3.4470549 | 14.90295 | ab |
| 3 | Rimsulfuron_30 | 16.8600 | 1.958834 | 12 | 11.1320549 | 22.58795 | b |
| 4 | Unweeded | 26.7725 | 1.958834 | 12 | 21.0445549 | 32.50045 | c |
Another example is the Bonferroni correction is simpler adjustment procedure and more commonly used.
| Treat | emmean | SE | df | lower.CL | upper.CL | .group | |
|---|---|---|---|---|---|---|---|
| 2 | Mixture_378 | 5.1275 | 1.958834 | 12 | -0.6206176 | 10.87562 | a |
| 1 | Metribuzin__348 | 9.1750 | 1.958834 | 12 | 3.4268824 | 14.92312 | ab |
| 3 | Rimsulfuron_30 | 16.8600 | 1.958834 | 12 | 11.1118824 | 22.60812 | b |
| 4 | Unweeded | 26.7725 | 1.958834 | 12 | 21.0243824 | 32.52062 | c |
The last multiplicity adjustment we will discuss is the option I most commonly see: Tukey’s HSD. This option is the default method - we just remove the adjust argument. All three are seen as conservative adjustments. You can also see the individual comparisons by including details=TRUE option.
| Treat | emmean | SE | df | lower.CL | upper.CL | .group | |
|---|---|---|---|---|---|---|---|
| 2 | Mixture_378 | 5.1275 | 1.958834 | 12 | 0.8595676 | 9.395432 | a |
| 1 | Metribuzin__348 | 9.1750 | 1.958834 | 12 | 4.9070676 | 13.442932 | ab |
| 3 | Rimsulfuron_30 | 16.8600 | 1.958834 | 12 | 12.5920676 | 21.127932 | b |
| 4 | Unweeded | 26.7725 | 1.958834 | 12 | 22.5045676 | 31.040432 | c |
We can use the results of the multiple comparisons test to help answer our hypotheses.
mu_j <- mu_j %>%
as_tibble()
Plot_means_tukey <- ggplot() +
geom_point(data = One_Way_Anova, aes(y = Weight, x = Treat), position = position_jitter(width = 0.1)) + #dots representing the raw data
geom_boxplot(data = One_Way_Anova, aes(y = Weight, x = Treat), position = position_nudge(x = -0.25), width = 0.25, outlier.shape = NA) + #boxplot
geom_point(data = mu_j, aes(y = emmean, x = Treat), position = position_nudge(x = 0.15), size = 2, color = "red") + # red dots representing the adjusted means
geom_errorbar(data = mu_j, aes(ymin = lower.CL, ymax = upper.CL, x = Treat), position = position_nudge(x = 0.15), color = "red", width = 0.1) + # red error bars representing the confidence limits of the adjusted means
geom_text(data=mu_j,aes(y=emmean,x=Treat,label=str_trim(.group)),position=position_nudge(x=0.25),color="black",angle=0)+ # red letters
labs(y="Weed Weight (g)",x="Treatment")+
theme_classic()
Plot_means_tukeyAll treatments are better than the control (i.e., unweeded). Mixture_378 is a better treatment than Rimsulfuron_30, and Metribuzin_348 is no better than either of the treatments.
When you have multiple independent variables in the linear model that are all categorical this is known as a two-way ANOVA. The following steps demonstrate how to perform an analysis of a two-way ANOVA.
Review the data and its characteristics you can confirm the appropriate analysis and understand the context of the hypothesis. This will also be helpful when considering how to visualize the data to help interpret the results.
These data show the body mass in grams among different penguin species on different islands and between sexes of penguins. There are other variables recorded that could be analyzed, but we will be focusing on the species, sex, and body mass (g) of the penguins.
Remember the “tidy” format…
## Rows: 344 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): species, island, sex
## dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The full dataset contains eight variables for 333 penguins (without NAs). Make sure to check the state/type of each variable in the dataframe to make sure the categorical variables are considered as Factors. You can either look in the Environment tab or use the str() function.
## species island bill_length_mm bill_depth_mm
## Length:333 Length:333 Min. :32.10 Min. :13.10
## Class :character Class :character 1st Qu.:39.50 1st Qu.:15.60
## Mode :character Mode :character Median :44.50 Median :17.30
## Mean :43.99 Mean :17.16
## 3rd Qu.:48.60 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## flipper_length_mm body_mass_g sex year
## Min. :172 Min. :2700 Length:333 Min. :2007
## 1st Qu.:190 1st Qu.:3550 Class :character 1st Qu.:2007
## Median :197 Median :4050 Mode :character Median :2008
## Mean :201 Mean :4207 Mean :2008
## 3rd Qu.:213 3rd Qu.:4775 3rd Qu.:2009
## Max. :231 Max. :6300 Max. :2009
## tibble [333 × 8] (S3: tbl_df/tbl/data.frame)
## $ species : chr [1:333] "Adelie" "Adelie" "Adelie" "Adelie" ...
## $ island : chr [1:333] "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
## $ bill_length_mm : num [1:333] 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
## $ bill_depth_mm : num [1:333] 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
## $ flipper_length_mm: num [1:333] 181 186 195 193 190 181 195 182 191 198 ...
## $ body_mass_g : num [1:333] 3750 3800 3250 3450 3650 ...
## $ sex : chr [1:333] "male" "female" "female" "female" ...
## $ year : num [1:333] 2007 2007 2007 2007 2007 ...
Two_Way_Anova$species <- factor(Two_Way_Anova$species)
Two_Way_Anova$sex <- factor(Two_Way_Anova$sex)
str(Two_Way_Anova)## tibble [333 × 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : chr [1:333] "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
## $ bill_length_mm : num [1:333] 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
## $ bill_depth_mm : num [1:333] 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
## $ flipper_length_mm: num [1:333] 181 186 195 193 190 181 195 182 191 198 ...
## $ body_mass_g : num [1:333] 3750 3800 3250 3450 3650 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 2 1 2 2 ...
## $ year : num [1:333] 2007 2007 2007 2007 2007 ...
Fitting a linear model with body_mass_g as the response variable and the factors of species and sex as the independent categorical variables.
We are interested in determining the following 1) The variation in body mass among the different species of penguins. 2) The variation in body mass among the different sexes of penguins. 3) The variation in body mass among the different species of penguins is different for females and males (interaction).
Hypothesis tests are as follows
Main effect of sex on body mass:
H0: mean body mass is equal between females and males
H1: mean body mass is different between females and males Main effect of
species on body mass
H0: mean body mass is equal between all three species H1: mean body mass
is different for at least one species Interaction between sex and
species:
H0: there is no interaction between sex and species, meaning that the
relationship between species and body mass is the same for females and
males
H1: there is an interaction between sex and species, meaning that the
relationship between species and body mass is different for females than
for males
Following the linear model being fit, we call the parameter estimates using the summary function.
##
## Call:
## lm(formula = body_mass_g ~ species * sex, data = Two_Way_Anova)
##
## Residuals:
## Min 1Q Median 3Q Max
## -827.21 -213.97 11.03 206.51 861.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3368.84 36.21 93.030 < 2e-16 ***
## speciesChinstrap 158.37 64.24 2.465 0.01420 *
## speciesGentoo 1310.91 54.42 24.088 < 2e-16 ***
## sexmale 674.66 51.21 13.174 < 2e-16 ***
## speciesChinstrap:sexmale -262.89 90.85 -2.894 0.00406 **
## speciesGentoo:sexmale 130.44 76.44 1.706 0.08886 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 309.4 on 327 degrees of freedom
## Multiple R-squared: 0.8546, Adjusted R-squared: 0.8524
## F-statistic: 384.3 on 5 and 327 DF, p-value: < 2.2e-16
And we extract the residuals values.
In this instance we are testing both species and sex and their interaction so the type III sums of squares is most appropriate. Type I sums of squares tests each factor sequentially, and the type II sums of squares is inappropriate when there is an interaction term.
This link can provide more details for your reference.
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| species | 2 | 145190219 | 72595109.56 | 758.358072 | 0.0000000 |
| sex | 1 | 37090262 | 37090261.78 | 387.459976 | 0.0000000 |
| species:sex | 2 | 1676557 | 838278.37 | 8.756997 | 0.0001973 |
| Residuals | 327 | 31302628 | 95726.69 | NA | NA |
| Sum Sq | Df | F value | Pr(>F) | |
|---|---|---|---|---|
| species | 143401584 | 2 | 749.015666 | 0.0000000 |
| sex | 37090262 | 1 | 387.459976 | 0.0000000 |
| species:sex | 1676557 | 2 | 8.756997 | 0.0001973 |
| Residuals | 31302628 | 327 | NA | NA |
| Sum Sq | Df | F value | Pr(>F) | |
|---|---|---|---|---|
| (Intercept) | 828480899 | 1 | 8654.648789 | 0.0000000 |
| species | 60350016 | 2 | 315.220419 | 0.0000000 |
| sex | 16613442 | 1 | 173.550777 | 0.0000000 |
| species:sex | 1676557 | 2 | 8.756997 | 0.0001973 |
| Residuals | 31302628 | 327 | NA | NA |
Remember that this test only tells us that there is a significant difference among the levels of the categorical variable. It doesn’t say which levels are significantly different.
To assess the normality of errors (residuals) we use a QQ (quantile-quantile)-plot. The standardized residuals are plotted against the respective percentiles of a normal distribution.
To assess homoskedasticity we look at the plot of the residuals against the fitted values.
Then the Shapiro-Wilks test on the residuals of the model.
##
## Shapiro-Wilk normality test
##
## data: epsilon
## W = 0.99776, p-value = 0.9367
Finally the Levene’s test on the independent variables in the model.
| Df | F value | Pr(>F) | |
|---|---|---|---|
| group | 5 | 1.390826 | 0.2272166 |
| 327 | NA | NA |
Results of the above figures and tests indicate that model assumptions have been met.
We can call for the emmeans for both main effects and the interaction.
mu_species <- mod %>%
emmeans(~ species) %>%
as_tibble() %>% #Converts the object to a tibble
arrange(emmean) #Arranges the emmean values as ascending values## NOTE: Results may be misleading due to involvement in interactions
| species | emmean | SE | df | lower.CL | upper.CL |
|---|---|---|---|---|---|
| Adelie | 3706.164 | 25.60590 | 327 | 3655.791 | 3756.537 |
| Chinstrap | 3733.088 | 37.51993 | 327 | 3659.277 | 3806.899 |
| Gentoo | 5082.289 | 28.37142 | 327 | 5026.475 | 5138.102 |
## NOTE: Results may be misleading due to involvement in interactions
| sex | emmean | SE | df | lower.CL | upper.CL |
|---|---|---|---|---|---|
| female | 3858.594 | 25.33613 | 327 | 3808.752 | 3908.437 |
| male | 4489.100 | 25.15752 | 327 | 4439.609 | 4538.591 |
group_by(Two_Way_Anova, species, sex) %>%
summarise(mean = round(mean(body_mass_g, na.rm = TRUE)),
sd = round(sd(body_mass_g, na.rm = TRUE)))## `summarise()` has grouped output by 'species'. You can override using the
## `.groups` argument.
| species | sex | mean | sd |
|---|---|---|---|
| Adelie | female | 3369 | 269 |
| Adelie | male | 4043 | 347 |
| Chinstrap | female | 3527 | 285 |
| Chinstrap | male | 3939 | 362 |
| Gentoo | female | 4680 | 282 |
| Gentoo | male | 5485 | 313 |
Following the evidence to support the linear model meets the assumptions, further analysis can be completed to produce unbiased estimates. The rejection of the null hypothesis that body mass is the same across the species and sexes suggests there is a difference among the levels of one variable across the other variable. We move onto the post-hoc multiple comparisons tests.
When you have a significant interaction effect you shouldn’t analyze the main effects independently since their interpretation is dependent on the other effect. There is another option for multiple comparisons tests when you have multiple factors and interactions. You can either test all pairwise comparisons by using a colon between the two terms, or you can test pairwise comparisons of one effect within the levels of another effect by using a pipe between the two terms. We will show both methods as an example.
| species | sex | emmean | SE | df | lower.CL | upper.CL | .group | |
|---|---|---|---|---|---|---|---|---|
| 1 | Adelie | female | 3368.836 | 36.21222 | 327 | 3297.597 | 3440.074 | a |
| 2 | Chinstrap | female | 3527.206 | 53.06120 | 327 | 3422.821 | 3631.590 | a |
| 5 | Chinstrap | male | 3938.971 | 53.06120 | 327 | 3834.586 | 4043.355 | b |
| 4 | Adelie | male | 4043.493 | 36.21222 | 327 | 3972.255 | 4114.731 | b |
| 3 | Gentoo | female | 4679.741 | 40.62586 | 327 | 4599.820 | 4759.662 | c |
| 6 | Gentoo | male | 5484.836 | 39.61427 | 327 | 5406.905 | 5562.767 | d |
within_means_tukey <- mod %>%
emmeans(~ species | sex) %>%
cld(Letters = letters)
within_means_tukey| species | sex | emmean | SE | df | lower.CL | upper.CL | .group | |
|---|---|---|---|---|---|---|---|---|
| 1 | Adelie | female | 3368.836 | 36.21222 | 327 | 3297.597 | 3440.074 | a |
| 2 | Chinstrap | female | 3527.206 | 53.06120 | 327 | 3422.821 | 3631.590 | b |
| 3 | Gentoo | female | 4679.741 | 40.62586 | 327 | 4599.820 | 4759.662 | c |
| 5 | Chinstrap | male | 3938.971 | 53.06120 | 327 | 3834.586 | 4043.355 | a |
| 4 | Adelie | male | 4043.493 | 36.21222 | 327 | 3972.255 | 4114.731 | a |
| 6 | Gentoo | male | 5484.836 | 39.61427 | 327 | 5406.905 | 5562.767 | b |
all_means_tukey <- all_means_tukey %>%
as_tibble() #convert object to tibble so that ggplot can read the details
Plot_all_means_tukey <- ggplot() +
geom_point(data = Two_Way_Anova, aes(y = body_mass_g, x = species, color = sex), position = position_dodgenudge(width = 0.9, x = 0)) + #dots representing the raw data
geom_boxplot(data = Two_Way_Anova, aes(y = body_mass_g, x = species, color = sex), width = 0.25, outlier.shape = NA, position = position_dodgenudge(width = 0.5, x = 0)) + # boxplot
geom_point(data = all_means_tukey, aes(y = emmean, x = species, group = sex), position = position_dodgenudge(width = 0.5, x = 0), size = 2, color = "red") + # red dots representing the adjusted means
geom_errorbar(data = all_means_tukey, aes(ymin = lower.CL, ymax = upper.CL, x = species, group = sex), position = position_dodgenudge(width = 0.5, x = 0),color = "red", width = 0.1) + # red error bars representing the confidence limits of the adjusted means
geom_text(data = all_means_tukey, aes(y = emmean, x = species, group = sex, label = str_trim(.group)), color = "black", position = position_dodgenudge(width = 1.2, x = -0.025), hjust = 0, angle = 0) + # red letters
labs(y = "Body Mass (g)", x = "Species") +
theme_classic()
Plot_all_means_tukey## Warning: `position_dodge()` requires non-overlapping x intervals.
within_means_tukey <- within_means_tukey %>%
as_tibble()
Plot_within_means_tukey <- ggplot() +
facet_wrap(~ sex, labeller = label_both) + #facet per sex level
geom_point(data = Two_Way_Anova, aes(y = body_mass_g, x = species, color = species)) + # dots representing the raw data
geom_boxplot(data = Two_Way_Anova, aes(y = body_mass_g, x = species, color = species), width = 0.25, outlier.shape = NA, position = position_nudge(x = 0.2)) + # boxplot
geom_point(data = within_means_tukey, aes(y = emmean, x = species), color = "red", position = position_nudge(x = 0.2)) + # red dots representing the adjusted means
geom_errorbar(data = within_means_tukey, aes(ymin = lower.CL, ymax = upper.CL, x = species), color = "red", width = 0.1, position = position_nudge(x = 0.2)) + # red error bars representing the confidence limits of the adjusted means
geom_text(data = within_means_tukey, aes(y = emmean, x = species, label = str_trim(.group)), color = "black", position = position_nudge(x = 0.4), hjust = 0) + # red letters
labs(y = "Body Mass (g)", x = "Species") +
theme_classic() # clearer plot format
Plot_within_means_tukeyWhen you have one or more independent variables in the linear model that are all continuous that is known as linear regression. Simple linear regression has one independent variable, and multiple linear regression has multiple independent variables.
Data from (Kniss et al. 2012) shows the sugar yield of sugar beet in response to volunteer corn density. The response variable is sucrose production in pounds per acre (LbsSucA), and the independent variable is volunteer corn density in plants per foot of row.
## Rows: 24 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Loc
## dbl (2): Density, LbsSucA
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Following the linear model being fit, you can call the parameter estimates using the summary function.
##
## Call:
## lm(formula = LbsSucA ~ Density, data = Simple_Linear_Regression)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2086.02 -1084.01 23.92 726.23 3007.73
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8677.6 485.6 17.869 1.38e-14 ***
## Density -20644.7 5645.2 -3.657 0.00139 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1383 on 22 degrees of freedom
## Multiple R-squared: 0.3781, Adjusted R-squared: 0.3498
## F-statistic: 13.37 on 1 and 22 DF, p-value: 0.001387
The resulting output provides estimates for the intercept (8677.6) and slope (-20644.7) of the regression line. The regression for this example would be Y (Pounds Sucrose/Acre)=8677.6-20644.7*X (Density). The interpretation of these results says that for every one unit change in the Volunteer corn density (plants/ft row) results in a 20,644.7 unit decrease in the yield. The maximum volunteer corn density used in the study was only 0.15 plants per foot of sugar beet row, so a one unit change in the density is beyond the scale of the data used in the study. We can still use the above equation to calculate the estimated pounds sucrose per acre for a fraction of a change in the density.
The ANOVA table can be used but provides little information here. Furthermore, notice that the results for the effect of density do not change between the methods used - there is only one effect and no interaction.
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| Density | 1 | 25572276 | 25572276 | 13.37385 | 0.0013869 |
| Residuals | 22 | 42066424 | 1912110 | NA | NA |
| Sum Sq | Df | F value | Pr(>F) | |
|---|---|---|---|---|
| Density | 25572276 | 1 | 13.37385 | 0.0013869 |
| Residuals | 42066424 | 22 | NA | NA |
| Sum Sq | Df | F value | Pr(>F) | |
|---|---|---|---|---|
| (Intercept) | 610547289 | 1 | 319.30549 | 0.0000000 |
| Density | 25572276 | 1 | 13.37385 | 0.0013869 |
| Residuals | 42066424 | 22 | NA | NA |
To assess the normality of errors (residuals) we use the QQ (quantile-quantile)-plot again.
To assess homoskedasticity we look at the plot of the residuals against the fitted values.
Then the Shapiro-Wilks test on the residuals of the model.
##
## Shapiro-Wilk normality test
##
## data: mod$residuals
## W = 0.96343, p-value = 0.511
No need to do a levene test since there is no categorical variable included in the model.
This is a simple plot for the simple linear regression model above. There will be more complicated figures and syntax required with more complicated models.
par(mfrow = c(1, 1), mar = c(3.2, 3.2, 2, 0.5), mgp = c(2, 0.7, 0)) #Number of rows of figures, the margins of the figure, and margin line for the axis title labels and line
plot(Simple_Linear_Regression$LbsSucA ~ Simple_Linear_Regression$Density, bty = "l", #Y and X variables for the plot and the "L" box type for the plot area
ylab = "Sugar yield (lbs/A)", xlab = "Volunteer corn density (plants/ft row)",
main = "Lingle, WY 2009", ylim = c(0, 10000))
abline(mod) # Add the regression line
# to add the regression equation into the plot:
int <- round(summary(mod)$coefficients[[1]]) # get intercept
sl <- round(summary(mod)$coefficients[[2]]) # get slope
reg.eq <- paste("Y =", int, sl, "* X") # create text regression equation
legend("bottomleft",reg.eq, bty = "n")The results from this model focus on the estimate of the slope (density) which has a negative value. This suggests that as density increases, yield will also decrease.
When you have multiple independent variables in the linear model where at least one is categorical and at least one is continuous that is known as ANCOVA (analysis of covariance). The following steps demonstrate how to perform an analysis of ANCOVA.
By reviewing the data and its characteristics you can confirm the appropriate analysis and understand the context of the hypothesis. This will also be helpful when considering how to visualize the data to help interpret the results.
Remember the “tidy” format…
ANCOVA <- read_csv(here("data", "ANCOVA.csv")) %>%
mutate(loc = as.factor(loc)) # Convert loc to a factor## Rows: 84 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): loc
## dbl (2): density, yield
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The dataset contains the results from a onion trial that tested the effect of planting density (plants per square meter) at two locations (Purnong Landing or Virginia).
## density yield loc
## Min. : 18.78 Min. : 28.96 P:42
## 1st Qu.: 39.54 1st Qu.: 78.46 V:42
## Median : 61.78 Median :108.90
## Mean : 73.33 Mean :119.70
## 3rd Qu.: 97.86 3rd Qu.:153.47
## Max. :184.75 Max. :272.15
Fitting a linear model with yield as the response variable and the factors of density (continuous) and location (categorical) as the independent variables. Make sure to check the state/type of each variable in the dataframe to make sure the categorical variable is considered a Factor. You can either look in the Environment tab or use the str() function.
We are interested in determining the following 1) The variation in yield among the different locations 2) The variation in yield along the gradient of density 3) The variation in yield along the gradient of density is different between two locations (interaction)
Hypothesis tests are as follows
Main effect of location on yield:
H0: mean yield is equal between locations
H1: mean yield is different between locations
Main effect of density on yield:
H0: slope of trend line between density and yield is zero
H1: slope of trend line between density and yield is not zero
Interaction between location and density:
H0: there is no interaction between location and density, meaning that
the relationship between density and yield is the same for both
locations
H1: there is an interaction between location and density, meaning that
the relationship between density and yield is different between
locations
## tibble [84 × 3] (S3: tbl_df/tbl/data.frame)
## $ density: num [1:84] 23.5 26.2 27.8 32.9 33.3 ...
## $ yield : num [1:84] 223 234 222 222 197 ...
## $ loc : Factor w/ 2 levels "P","V": 1 1 1 1 1 1 1 1 1 1 ...
Following the linear model being fit, you can call the parameter estimates using the summary function.
##
## Call:
## lm(formula = yield ~ density * loc, data = ANCOVA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.054 -16.329 -7.433 14.720 110.485
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 219.65959 8.61694 25.492 < 2e-16 ***
## density -1.13684 0.09641 -11.792 < 2e-16 ***
## locV -37.97622 11.67184 -3.254 0.00167 **
## density:locV 0.07090 0.13904 0.510 0.61148
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.03 on 80 degrees of freedom
## Multiple R-squared: 0.768, Adjusted R-squared: 0.7593
## F-statistic: 88.27 on 3 and 80 DF, p-value: < 2.2e-16
After fitting a model we extract the residuals.
The ANOVA table needs the type 3 test for the interaction effect
| Sum Sq | Df | F value | Pr(>F) | |
|---|---|---|---|---|
| (Intercept) | 440242.8033 | 1 | 649.8222443 | 0.0000000 |
| density | 94198.7191 | 1 | 139.0424161 | 0.0000000 |
| loc | 7172.0378 | 1 | 10.5863166 | 0.0016701 |
| density:loc | 176.1878 | 1 | 0.2600627 | 0.6114811 |
| Residuals | 54198.5513 | 80 | NA | NA |
The interaction effect isn’t significant, but each main effect is. Remember that this test only tells us that there is a significant difference among the levels of the categorical variable. It doesn’t say which levels are significantly different.
Assess the normality of errors (residuals) using a QQ (quantile-quantile)-plot.
Some deviation from the line is normal, but look out for strong evidence of a skewed distribution where the tails are further from the line.
Then assess homoskedasticity with the plot of the residuals against the fitted values.
The formal hypothesis used to test the linear model assumptions for normally distributed residuals is the Shapiro-Wilks test on the residuals of the model. The null hypothesis is that the residuals are normally distributed. A significant result p value < 0.05 indicates the residuals are not normally distributed.
##
## Shapiro-Wilk normality test
##
## data: mod$residuals
## W = 0.90293, p-value = 1.045e-05
In the presence of results suggesting failure to meet one of the linear model assumptions different stabilizing transformation are available to remove potential biases and produce better estimates from the linear model. Logarithmic transformation are often used for counts and proportions. The square root transformation is also used for counts. The arcsin-square root transformation is very common with proportions.
These transformations were necessary prior to the advancements in computing power that allow for generalized linear models (glm) and generalized linear least squares models (gls) that deal with non-normal and heteroscedastic results. Other options we will discuss include nonparametric methods that make few or no assumptions about the distribution of residuals.
Given the results of the residuals plot and the qqplot we will try different transformations to meet the model assumptions.
##
## Call:
## lm(formula = log(yield) ~ density * loc, data = ANCOVA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.28094 -0.08451 -0.01393 0.08597 0.46119
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.5469442 0.0453984 122.184 < 2e-16 ***
## density -0.0096874 0.0005079 -19.072 < 2e-16 ***
## locV -0.1868138 0.0614932 -3.038 0.00322 **
## density:locV -0.0017592 0.0007325 -2.402 0.01864 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1371 on 80 degrees of freedom
## Multiple R-squared: 0.9163, Adjusted R-squared: 0.9132
## F-statistic: 292.1 on 3 and 80 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = sqrt(yield) ~ density * loc, data = ANCOVA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6303 -0.5901 -0.1920 0.6584 3.6758
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.248086 0.314155 48.537 < 2e-16 ***
## density -0.051729 0.003515 -14.717 < 2e-16 ***
## locV -1.415515 0.425530 -3.326 0.00133 **
## density:locV -0.002126 0.005069 -0.419 0.67610
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9489 on 80 degrees of freedom
## Multiple R-squared: 0.8507, Adjusted R-squared: 0.8451
## F-statistic: 151.9 on 3 and 80 DF, p-value: < 2.2e-16
Neither of the transformations improve the distribution of the residuals, so we will try fitting a model that includes a polynomial term to address the curved linear relationship with the residuals.
##
## Call:
## lm(formula = yield ~ poly(density, 2) * loc, data = ANCOVA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.719 -9.717 0.843 8.556 80.295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 138.096 2.842 48.598 < 2e-16 ***
## poly(density, 2)1 -452.487 25.680 -17.620 < 2e-16 ***
## poly(density, 2)2 164.686 25.767 6.391 1.11e-08 ***
## locV -35.824 4.014 -8.924 1.52e-13 ***
## poly(density, 2)1:locV 70.837 36.994 1.915 0.0592 .
## poly(density, 2)2:locV 12.187 36.737 0.332 0.7410
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.15 on 78 degrees of freedom
## Multiple R-squared: 0.89, Adjusted R-squared: 0.8829
## F-statistic: 126.2 on 5 and 78 DF, p-value: < 2.2e-16
| Sum Sq | Df | F value | Pr(>F) | |
|---|---|---|---|---|
| (Intercept) | 778268.66 | 1 | 2361.813187 | 0.0000000 |
| poly(density, 2) | 107659.44 | 2 | 163.357143 | 0.0000000 |
| loc | 26242.18 | 1 | 79.637187 | 0.0000000 |
| poly(density, 2):loc | 1246.35 | 2 | 1.891151 | 0.1577523 |
| Residuals | 25702.69 | 78 | NA | NA |
We find the same results that the interaction effect isn’t significant, and we can now move onto the multiple comparisons steps.
We will use the emmeans() function to produce the expected marginal means.
mu_location <- mod_2 %>%
emmeans(~ loc) %>% # same as , at = list(density = mean(ANCOVA$density)) - conditional effect
cld(Letters = letters) %>%
as_tibble() %>% # Converts the object to a tibble
arrange(emmean) # Arranges the emmean values as ascending values## NOTE: Results may be misleading due to involvement in interactions
| loc | emmean | SE | df | lower.CL | upper.CL | .group |
|---|---|---|---|---|---|---|
| V | 84.49209 | 3.992269 | 78 | 76.54409 | 92.44009 | a |
| P | 121.54122 | 3.649864 | 78 | 114.27490 | 128.80754 | b |
We are comparing the slopes of the two trend lines to see if they are significantly different. The results of the ANOVA table indicate that they are not significantly different. We can review the multiple comparisons results to demonstrate how you would show significant results in the event that future examples show a significant interaction effect.
#emtrends to compare slopes between categorical variables
mu_density_linear <- mod_2 %>%
emtrends(pairwise ~ loc, var = "density", at = list(density = mean(ANCOVA$density)), deriv = 1) %>%
cld(Letters = letters)
mu_density_linear| loc | density.trend | SE | df | lower.CL | upper.CL | .group |
|---|---|---|---|---|---|---|
| P | -1.557676 | 0.0941092 | 78 | -1.745033 | -1.370319 | a |
| V | -1.397229 | 0.0853642 | 78 | -1.567177 | -1.227283 | a |
mu_density_quadratic <- mod_2 %>%
emtrends(pairwise ~ loc, var = "density", deriv = 2) %>%
cld(Letters = letters)
mu_density_quadratic| loc | density.trend | SE | df | lower.CL | upper.CL | .group |
|---|---|---|---|---|---|---|
| P | -1.557676 | 0.0941092 | 78 | -1.745033 | -1.370319 | a |
| V | -1.397229 | 0.0853642 | 78 | -1.567177 | -1.227283 | a |
We need to produce a trend line for each location. We will use the expand.grid() and predict() functions to produce out of sample predictions that will be the trend line. The expand.grid() function would need to include other terms from the model - not relevant in this example. You would set either the mean or median of the other variables since you are testing the relationship between density and yield for both locations - holding all other variables constant.
## density yield loc
## Min. : 18.78 Min. : 28.96 P:42
## 1st Qu.: 39.54 1st Qu.: 78.46 V:42
## Median : 61.78 Median :108.90
## Mean : 73.33 Mean :119.70
## 3rd Qu.: 97.86 3rd Qu.:153.47
## Max. :184.75 Max. :272.15
new.x <- expand.grid(density = seq(18.78, 184.75, length.out=100),
loc = levels(ANCOVA$loc))
head(new.x)| density | loc |
|---|---|
| 18.78000 | P |
| 20.45646 | P |
| 22.13293 | P |
| 23.80939 | P |
| 25.48586 | P |
| 27.16232 | P |
new.y <- predict(mod_2, newdata = new.x, interval = 'confidence')
#Bring new.x and new.y together
addThese <- data.frame(new.x, new.y)
addThese <- rename(addThese, yield = fit)
head(addThese)| density | loc | yield | lwr | upr |
|---|---|---|---|---|
| 18.78000 | P | 235.5113 | 220.3284 | 250.6943 |
| 20.45646 | P | 231.1478 | 216.6635 | 245.6321 |
| 22.13293 | P | 226.8389 | 213.0302 | 240.6477 |
| 23.80939 | P | 222.5847 | 209.4276 | 235.7417 |
| 25.48586 | P | 218.3850 | 205.8551 | 230.9148 |
| 27.16232 | P | 214.2399 | 202.3118 | 226.1680 |
ggplot(ANCOVA, aes(x = density, y = yield, color = loc)) +
geom_point(size = 5) +
geom_smooth(data = addThese, aes(ymin = lwr, ymax = upr, fill = loc), stat = 'identity') +
scale_color_manual(values = c(P = "blue", V = "red")) +
scale_fill_manual(values = c(P = "blue", V = "red")) +
theme_classic()We see parallel trend lines suggesting no significant difference in the relationship between density and yield among the two locations.
We are going to revisit the penguin data to use as practice. We can upload the data again.
Two_Way_Anova<-read_csv(here("data","Two_Way_ANOVA.csv")) %>%
mutate(island = as.factor(island),
year = as.factor(year),
species = as.factor(species))## Rows: 344 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): species, island, sex
## dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Two_Way_Anova <- subset(Two_Way_Anova, !is.na(sex)) #removes NAs within the sex variable
str(Two_Way_Anova)## tibble [333 × 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:333] 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
## $ bill_depth_mm : num [1:333] 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
## $ flipper_length_mm: num [1:333] 181 186 195 193 190 181 195 182 191 198 ...
## $ body_mass_g : num [1:333] 3750 3800 3250 3450 3650 ...
## $ sex : chr [1:333] "male" "female" "female" "female" ...
## $ year : Factor w/ 3 levels "2007","2008",..: 1 1 1 1 1 1 1 1 1 1 ...
Example 1 Following the guidelines above test the following hypotheses:
Think about which type of model you need to use to test these hypotheses. Report the results from your assessments of the linear model assumptions. Use the multiple comparisons results in a final figure to demonstrate how the results support or refute the above hypotheses.
Example 2 Following the guidelines above test the following hypotheses:
Think about which type of model you need to use to test these hypotheses. Report the results from your assessments of the linear model assumptions. Use the multiple comparisons results in a final figure to demonstrate how the results support or refute the above hypotheses.
Statistical consultant, cfair13@uga.edu↩︎